---
title: "Operator"
icon: "computer-mouse"
---

the operator api allows for powerful desktop automation using accessibility roles and elements. it provides a robust way to interact with applications programmatically.

<Note>
  compatibility note:

   some functions of the operator api currently works on macos only. on windows/linux, please use the pixel api

   [windows in progress](https://github.com/mediar-ai/screenpipe/pull/1694)

   instead.
</Note>

to understand roles better, open the macos accessibility inspector and examine the roles for any application.

[feel free to use our docs as context in cursor agent through MCP](https://docs.screenpi.pe/mcp-server#mintlify-mcp-use-our-docs-in-cursor%2C-claude%2C-etc)

## installation

<Tabs>
  <Tab title="npm">
    ```bash
    npm install @screenpipe/browser
    ```
  </Tab>
  <Tab title="pnpm">
    ```bash
    pnpm add @screenpipe/browser
    ```
  </Tab>
  <Tab title="bun">
    ```bash
    bun add @screenpipe/browser
    ```
  </Tab>
  <Tab title="yarn">
    ```bash
    yarn add @screenpipe/browser
    ```
  </Tab>
</Tabs>

this also works in node.js using `@screenpipe/js`

## basic usage

```typescript
import { pipe } from '@screenpipe/browser'

async function simpleAutomation() {
  try {
    // open an application
    await pipe.operator.openApplication("Chrome")
    
    // navigate to a website
    await pipe.operator.openUrl("https://github.com/mediar-ai/screenpipe")
    
    // find search fields by role
    const searchFields = await pipe.operator
      .getByRole("searchfield", {
        app: "Chrome",
        activateApp: true
      })
      .all(3)
    
    if (searchFields.length > 0) {
      // fill the search field
      await pipe.operator
        .getById(searchFields[0].id, { app: "Chrome" })
        .fill("automation")
      
      // click a button
      const buttons = await pipe.operator
        .getByRole("button", { app: "Chrome" })
        .all(5)
      
      if (buttons.length > 0) {
        await pipe.operator
          .getById(buttons[0].id, { app: "Chrome" })
          .click()
      }
      
      // scroll page content
      await pipe.operator
        .getById(searchFields[0].id, { app: "Chrome" })
        .scroll("down", 300)
    }
  } catch (error) {
    console.error("automation failed:", error)
  }
}
```

## core methods

the operator api provides a set of intuitive methods for automating desktop interactions:

| method                          | description                          | compatibility     | example                                                 |
| ------------------------------- | ------------------------------------ | ----------------- | ------------------------------------------------------- |
| `openApplication(name)`         | launches an application              | macOS only        | `pipe.operator.openApplication("Chrome")`               |
| `openUrl(url, browser?)`        | opens a url in a browser             | macOS only        | `pipe.operator.openUrl("github.com")`                   |
| `getByRole(role, options)`      | finds elements by accessibility role | macOS only        | `pipe.operator.getByRole("button", {app: "Chrome"})`    |
| `getById(id, options)`          | gets element by id                   | macOS only        | `pipe.operator.getById("element-123", {app: "Chrome"})` |
| `.click()`                      | clicks an element                    | macOS only        | `pipe.operator.getById(id).click()`                     |
| `.fill(text)`                   | enters text in a field               | macOS only        | `pipe.operator.getById(id).fill("hello")`               |
| `.scroll(direction, amount)`    | scrolls an element                   | macOS only        | `pipe.operator.getById(id).scroll("down", 300)`         |
| `pixel.type(text)`              | types text                           | all platforms     | `pipe.operator.pixel.type("hello world")`                        |
| `pixel.press(key)`              | presses a keyboard key               | all platforms     | `pipe.operator.pixel.press("enter")`                             |
| `pixel.moveMouse(x, y)`         | moves mouse cursor to position       | all platforms     | `pipe.operator.pixel.moveMouse(100, 200)`                        |
| `pixel.click(button)`           | clicks mouse button                  | all platforms     | `pipe.operator.pixel.click("left")`                              |

### pixel api

pixel API is a higher level API that is useful for:
- controlling your iPhone through iPhone mirroring (because you cannot parse the screen of your iPhone)
- Windows and Linux which does not support yet the functions like `openApplication`, `getByRole`, `click`, etc.

```typescript
// type text
await pipe.operator.pixel.type("hello world")

// press key
await pipe.operator.pixel.press("enter")

// move mouse
await pipe.operator.pixel.moveMouse(100, 200)

// click
await pipe.operator.pixel.click("left") // "left" | "right" | "middle"
```

## common accessibility roles

to understand better roles, feel free to open MacOS Accessibility Inspector and see the roles for any application.

when using `getByRole()`, you'll need to specify the accessibility role. here are common ones:

- `"button"` - clickable buttons
- `"textfield"` - text input fields
- `"searchfield"` - search input fields
- `"checkbox"` - checkbox elements
- `"radiobutton"` - radio button elements
- `"combobox"` - dropdown menus
- `"link"` - hyperlinks
- `"image"` - images
- `"statictext"` - text labels
- `"scrollarea"` - scrollable containers

## advanced usage examples

### automating form filling

```typescript
async function fillContactForm() {
  // open the app and navigate to form
  await pipe.operator.openApplication("Chrome")
  await pipe.operator.openUrl("https://example.com/contact")
  
  // wait for page to load (simple delay)
  await new Promise(resolve => setTimeout(resolve, 2000))
  
  // find form fields
  const nameField = await pipe.operator
    .getByRole("textfield", { 
      app: "Chrome",
      activateApp: true
    })
    .first()
  
  const emailField = await pipe.operator
    .getByRole("textfield", { 
      app: "Chrome"
    })
    .first()
  
  const messageField = await pipe.operator
    .getByRole("textfield", { 
      app: "Chrome"
    })
    .first()
  
  // fill the form
  await pipe.operator
    .getById(nameField.id, { app: "Chrome" })
    .fill("john doe")
  
  await pipe.operator
    .getById(emailField.id, { app: "Chrome" })
    .fill("john@example.com")
  
  await pipe.operator
    .getById(messageField.id, { app: "Chrome" })
    .fill("this is an automated message from screenpipe!")
  
  // find and click submit button
  const submitButton = await pipe.operator
    .getByRole("button", { 
      app: "Chrome"
    })
    .first()
  
  await pipe.operator
    .getById(submitButton.id, { app: "Chrome" })
    .click()
}
```

### automating app workflows

```typescript
async function processImages() {
  // open photoshop
  await pipe.operator.openApplication("Adobe Photoshop")
  await new Promise(resolve => setTimeout(resolve, 3000)) // wait for app to launch
  
  // open file
  await pipe.operator
    .getByRole("menuitem", { 
      app: "Adobe Photoshop"
    })
    .first()
    .then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
  
  await pipe.operator
    .getByRole("menuitem", { 
      app: "Adobe Photoshop"
    })
    .first()
    .then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
  
  // navigate file picker (simplified)
  // in practice, you'd need more complex logic to navigate the file picker
  
  // apply filter
  await pipe.operator
    .getByRole("menuitem", { 
      app: "Adobe Photoshop"
    })
    .first()
    .then(menu => pipe.operator.getById(menu.id, { app: "Adobe Photoshop" }).click())
  
  // and so on...
}
```

## ai-powered automation

for more powerful automation, combine the operator api with vercel ai sdk to enable ai-driven desktop interactions:

```typescript
import { useState } from "react"
import { streamText, convertToCoreMessages } from "ai"
import { createOpenAI } from "@ai-sdk/openai"
import { pipe } from "@screenpipe/browser"
import { z } from "zod"

export function AIAutomationAgent() {
  const [input, setInput] = useState("")
  const [output, setOutput] = useState("")
  
  const handleSubmit = async (e) => {
    e.preventDefault()
    setOutput("")
    
    const model = createOpenAI({
      apiKey: process.env.OPENAI_API_KEY
    })("gpt-4o")
    
    const result = streamText({
      model,
      messages: convertToCoreMessages([{
        role: "user",
        content: input
      }]),
      system: "you are a desktop automation assistant. help users by performing actions on their computer.",
      tools: {
        openApplication: {
          description: "open an application",
          parameters: z.object({
            appName: z.string().describe("the name of the application to open")
          }),
          execute: async ({ appName }) => {
            const success = await pipe.operator.openApplication(appName)
            return success
              ? `opened ${appName} successfully`
              : `failed to open ${appName}`
          }
        },
        
        openUrl: {
          description: "open a url in a browser",
          parameters: z.object({
            url: z.string().describe("the url to open"),
            browser: z.string().optional().describe("the browser to use")
          }),
          execute: async ({ url, browser }) => {
            const success = await pipe.operator.openUrl(url, browser)
            return success
              ? `opened ${url} in ${browser || "default browser"}`
              : `failed to open ${url}`
          }
        },
        
        findAndClick: {
          description: "find an element by role and click it",
          parameters: z.object({
            app: z.string().describe("the application name"),
            role: z.string().describe("the accessibility role of the element")
          }),
          execute: async ({ app, role }) => {
            try {
              const elements = await pipe.operator
                .getByRole(role, { app, activateApp: true })
                .all(5)
                
              if (elements.length > 0) {
                await pipe.operator
                  .getById(elements[0].id, { app })
                  .click()
                return `clicked ${role} element in ${app}`
              }
              return `no ${role} elements found in ${app}`
            } catch (error) {
              return `error: ${error.message}`
            }
          }
        },
        
        fillText: {
          description: "fill text in a form field",
          parameters: z.object({
            app: z.string().describe("the application name"),
            role: z.string().describe("the role of the element to fill"),
            text: z.string().describe("the text to enter")
          }),
          execute: async ({ app, role, text }) => {
            try {
              const elements = await pipe.operator
                .getByRole(role, { app, activateApp: true })
                .all(5)
                
              if (elements.length > 0) {
                await pipe.operator
                  .getById(elements[0].id, { app })
                  .fill(text)
                return `filled text in ${role} element in ${app}`
              }
              return `no ${role} elements found in ${app}`
            } catch (error) {
              return `error: ${error.message}`
            }
          }
        }
      },
      toolCallStreaming: true,
      maxSteps: 5
    })
    
    for await (const chunk of result.textStream) {
      setOutput(prev => prev + chunk)
    }
  }
  
  return (
    <div>
      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={e => setInput(e.target.value)}
          placeholder="e.g., 'open chrome and go to github'"
        />
        <button type="submit">run</button>
      </form>
      <div>{output}</div>
    </div>
  )
}
```

for a complete implementation with automatic tool selection, see the [hello-world-computer-use](https://github.com/mediar-ai/screenpipe/blob/main/pipes/hello-world-computer-use/components/chat.tsx) example pipe.

## troubleshooting

if you're having issues with the operator api:

1. **macos permissions**: ensure screenpipe has accessibility permissions in system settings \> privacy & security \> accessibility
2. **app names**: use exact app names as they appear in the applications folder
3. **timing issues**: add delays between operations, as ui elements may take time to load
4. **debugging**: log element ids and roles to help identify the right elements
5. **app focus**: use the `activateApp: true` option to ensure the target app is in focus

for more detailed debugging, use the macos accessibility inspector to identify exact roles and properties of ui elements.

## practical use cases

here are some real-world applications for the operator api:

- **messaging automation**
  - scrape whatsapp conversations and export them to spreadsheets
  - auto-respond to common imessage inquiries when you're busy
  - track response rates across different messaging platforms

- **social media management**
  - schedule and post content across multiple platforms
  - collect engagement metrics from twitter, instagram, or linkedin
  - automate following/unfollowing based on specific criteria
  - export comments and replies for sentiment analysis

- **data collection and research**
  - extract data from websites that don't have accessible apis
  - compile information across multiple applications into a single report
  - monitor prices or availability of products across different sites
  - build comprehensive research databases from scattered sources

- **personal productivity**
  - automate repetitive daily tasks (checking emails, organizing files)
  - create custom workflows between applications that don't normally integrate
  - set up intelligent reminders based on content of messages or emails
  - auto-fill forms with personal or business information

- **customer relationship management**
  - track conversations across multiple platforms for each contact
  - automatically update crm systems with new interaction data
  - generate follow-up reminders based on conversation content
  - build comprehensive customer profiles from scattered data sources

- **content creation and editing**
  - automate screenshots or recordings of specific application states
  - batch process images or documents using desktop applications
  - extract text from images or pdfs for further processing
  - organize and tag media files based on their content

these automation ideas become even more powerful when combined with ai for intelligent decision-making based on the content being processed.

examples: 

- [scrap any desktop app into a spreadsheet](https://github.com/mediar-ai/screenpipe/blob/main/pipes/desktop-to-table)
- [using operator api through chat (not very accurate)](https://github.com/mediar-ai/screenpipe/blob/main/pipes/hello-world-computer-use/components/chat.tsx)
